Skip to content

Update to use Prometheus Pusher context manager, #129#132

Closed
voetberg wants to merge 1 commit into
rucio:masterfrom
voetberg:common_context_manager_update
Closed

Update to use Prometheus Pusher context manager, #129#132
voetberg wants to merge 1 commit into
rucio:masterfrom
voetberg:common_context_manager_update

Conversation

@voetberg
Copy link
Copy Markdown
Contributor

Update some of the common probes to use utils.common.PrometheusPusher. Uses #89 as a basis, but swaps out the prometheus_client context manager. Includes some metric name changes to match file names, and printing out some error info for debugging.

CC: @ericvaandering

@voetberg voetberg force-pushed the common_context_manager_update branch from ef2d176 to 072758c Compare February 15, 2024 16:30
Comment thread common/check_expired_dids
result = query.scalar() or 0
# Possible check against a threshold. If result > max_value then sys.exit(CRITICAL)

manager.gauge('expired_dids.total',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does change the metric value. IIRC, ATLAS is ok with this in theory as we wanted to make things more sensical. But I'd like @dchristidis thoughts before we start using this for CMS.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be generally acceptable. But please add a comprehensive list of such changes in the commit message.

Comment thread common/check_fts_backlog Outdated

# Add to metrics
backlog_count = summary['submitted'] + summary['active'] + summary['staging'] + summary['started']
manager.gauge(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes names too, but in a slightly more subtle way. However, I think the variable things should always come at the end of the statsd metric.

On a style note, I'd prefer combining lines 138 and 139. You could also split that long line (140) just before the first closing paren and it would look nice.

Comment thread common/check_fts_backlog Outdated
Comment thread common/check_fts_backlog Outdated
print(f"Messages in queue: {message_count}")

manager.gauge(
"messages_to_submit.queues.messages",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this changes the metric value. The job label in prometheus will say where this comes from, but it still might be a good idea.

Comment thread common/check_new_dids
.filter(models.DataIdentifier.is_new.isnot(None)))
result = query.scalar() or 0

manager.gauge("new_dids").set(result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a new metric name

Comment thread common/check_stuck_rules Outdated
result = session.execute(sql_text(count_sql)).fetchone()[0]

with PrometheusPusher() as manager:
manager.gauge("unevaluated_dids").set(result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a change

Comment thread common/check_unlocked_replicas Outdated
Comment thread common/check_updated_dids
push_to_gateway(server.strip(), job='check_updated_dids', registry=registry)
except:
continue
manager.gauge('updated_dids').set(result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another change

@voetberg voetberg force-pushed the common_context_manager_update branch 2 times, most recently from 7992510 to 174bfd1 Compare February 20, 2024 22:06
…ge uniformly rucio#129

Changes to metric names include:
                    * undertaker_expired_dids -> expired_dids.total
                    * fts3.{hostname}.submitted -> fts_backlog.submitted.{hostname}
                    * hermes_queues_messages.queues.messages -> messages_to_submit.queues.messages
                    * transmogrifier_new_dids -> new_dids
                    * judge_stuck_rules_without_missing_source_replica -> stuck_rules.{source_status} (source_status = [without_missing_source_replica, with_missing_source_replica])
                    * check_transfer_queues_status -> transfer_queues_status
                    * judge.waiting_dids -> unevaluated_dids
                    * reaper.unlocked_replicas -> unlocked_replicas.{replica_status} (replica_status = [expired, unlocked])
                    * judge.updated_dids -> updated_dids
Comment thread common/check_fts_backlog
print(f'Busy channels (>{busylimit} submitted):')
for bc in busy_channels:
activities_str = ", ".join([(f"{key}: {val}") for key, val in bc['activities'].items()])
print(f'{bc['src']} to {bc['dst']} : {bc['submitted']} submitted jobs {activities_str}')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not compile. Did you mean to use double quotes here?
Also at other places for fstrings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a linter got a little excited and undid my work - thanks for catching it

@dchristidis
Copy link
Copy Markdown
Contributor

@ericvaandering and @dynamic-entropy, please resolve the conversations that you believe have been addressed to your satisfaction. I will begin my review then.

@voetberg
Copy link
Copy Markdown
Contributor Author

voetberg commented Aug 5, 2024

@/ericvaandering and @/dynamic-entropy, please resolve the conversations that you believe have been addressed to your satisfaction. I will begin my review then.

@dchristidis I think I'll just close this PR and restart each probe as its own PR. All our sql2.0 guidelines have been documented since this PR was put up, so a lot of things are (obviously) wrong here.

@voetberg voetberg closed this Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants